Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

144

Applications in Natural Language Processing

The activations are binarized following previous works as:

ˆXⁱ

B ^{= Sign(}^Xⁱ

R^{) =}

−1,

if Xⁱ

R ^<⁰

+1,

if Xⁱ

R ^⩾⁰

(5.37)

In that case,

T ˆXB = nXR, where nXR is number of elements in XR, and α∗can be

solved as:

α^∗= ^X^R

T ˆXB

nXR

= ^||^X^R^||^l¹

nXR

(5.38)

For the activations in attention layers or after the ReLU non-linearity layers with XR ∈

Rⁿ

+^{, the authors binarized the activations to}^ˆX^B ^∈{⁰^,¹^}ⁿ^{by rounding the real-valued}

activations:

ˆXⁱ

B ⁼^⌊^Clip(^Xⁱ

R^,⁰^,¹⁾^⌉⁼

if Xⁱ

R ^<⁰^.⁵

if Xⁱ

R ^⩾⁰^.⁵

(5.39)

In that case,

T ˆXB = n{XR⩾0.5} where n{XR⩾0.5} denotes the number of elements in XR

that are greater than or equal to 0.5. Then α^∗can be solved as:

α^∗= ^||^X^R^·¹^{^X^R^⩾⁰^.⁵^}^||^l¹

n{XR⩾0.5}

(5.40)

5.10.2

Elastic Binarization Function

The ﬁxed scaling and threshold derived previously works reasonably well, but might not be

optimal since it ignores the distribution of the variable which is being binarized. Ideally,

these parameters can be learned during training to minimize the target loss.

When using classical binarization methods, i.e., ^ˆXⁱ

B ^{= Sign(}^Xⁱ

R^{), the binary output}

is independent of the scale of the real-valued input. However, in our case where ^ˆXⁱ

B ⁼

⌊Clip(Xⁱ

R^,⁰^,¹⁾^⌉^{, this independence no longer holds. Learning the scaling and threshold}

parameters, and how to approximate the gradients precisely in the process becomes crucial

for the ﬁnal accuracy.

To handle this, the authors proposed the elastic binarization function to learn both the

scale α ∈R+ and the threshold β ∈R:

Xⁱ

B ⁼^α^ˆXⁱ

B ⁼^α^⌊^Clip(^Xⁱ

R ⁻^β

, 0, 1)⌉

(5.41)

In the function, α is initialized with α^∗in Eq. (5.38) and β to be 0, and it is trained with

gradients from the ﬁnal loss. To back-propagate the gradients to α through the discretized

binarization function, the straight-through estimator (STE) [9] is leveraged to bypass the

incoming gradients to the round function to be the outgoing gradients:

∂Xⁱ

∂α

= ^ˆXⁱ

B ⁺^α∂^ˆXⁱ

∂α

ST E

≈

ˆXⁱ

B ⁺^α∂^Clip(^Xⁱ

R⁻^β

, 0, 1)

∂α

⎧

⎪

⎨

⎪

⎩

if Xⁱ

R ^{< β}

β−Xⁱ

if β ⩽Xⁱ

R ^{< α/}^{2 +}^β

1 −^Xⁱ

R⁻^β

if α/2 + β ⩽Xⁱ

R ^{< α}⁺^β

if Xⁱ

R ^⩾^α⁺^β

(5.42)